Text Normalization for the Pronunciation of Non-standard Words in an Inflected Language

نویسندگان

  • Gerasimos Xydas
  • Georgios Karberis
  • Georgios Kouroupetroglou
چکیده

In this paper we present a novel approach, called “Text to Pronunciation (TtP)”, for the proper normalization of Non-Standard Words (NSWs) in unrestricted texts. The methodology deals with inflection issues for the consistency of the NSWs with the syntactic structure of the utterances they belong to. Moreover, for the achievement of an augmented auditory representation of NSWs in Text-to-Speech (TtS) systems, we introduce the coupling of the standard normalizer with: i) a language generator that compiles pronunciation formats and ii) VoiceXML attributes for the guidance of the underlying TtS to imitate the human speaking style in the case of numbers. For the evaluation of the above model in the Greek language we have used a 158K word corpus with 4499 numerical expressions. We achieved an internal error rate of 7,67% however, only 1,02% were perceivable errors due to the nature of the language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Normalization of non-standard words

In addition to ordinary words and names, real text contains non-standard “words” (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound” rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with resp...

متن کامل

SSML Goes International – A Standard Story

Since September 2004, the SSML 1.0 [1] specification has been a W3C Recommendation. SSML is the standard way that a Voice Browser controls speech synthesis engine. Given that it is a standard, actions to define the language of the text to be rendered, to change between several voices, to insert pauses, to perform simple text normalization (e.g. acronym expansions, such as reading W3C as “World ...

متن کامل

An Account of Iranian EFL Pronunciation Errors through L1 Transfer

In light of the fact that L2 pronunciation errors are often caused by the transfer of well-established L1 sound systems, this paper examines some of the outstanding phonological differences between Persian and English. Comparing segmental and supra-segmental aspects of both languages, this study also discusses several problematic areas of pronunciation facing Iranian learners of English. To rea...

متن کامل

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language togethe...

متن کامل

Context Tailoring for Text Normalization

Language processing tools suffer from significant performance drops in social media domain due to its continuously evolving language. Transforming non-standard words into their standard forms has been studied as a step towards proper processing of ill-formed texts. This work describes a normalization system that considers contextual and lexical similarities between standard and non-standard wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004